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Abstract 



Wc study quasi-Newton methods from the viewpoint of information 
geometry induced associated with Bregman divergences. Fletcher has 
studied a variational problem which derives the approximate Hessian 
update formula of the quasi-Newton methods. We point out that the 
variational problem is identical to optimization of the Kullback-Lciblcr 
divergence, which is a discrepancy measure between two probability 
distributions. The Kullback-Lcibler divergence for the multinomial 
normal distribution corresponds to the objective function Fletcher has 
considered. Wc introduce the Bregman divergence as an extension 
of the Kullback-Leibler divergence, and derive extended quasi-Newton 
update formulae based on the variational problem with the Bregman 
divergence. As well as the Kullback-Leibler divergence, the Bregman 
divergence introduces the information geometrical structure on the set 
of positive definite matrices. From the geometrical viewpoint, we study 
the approximation Hessian update, the invariance property of the up- 
date formulae, and the sparse quasi-Newton methods. Especially, we 
point out that the sparse quasi-Newton method is closely related to sta- 
tistical methods such as the EM-algorithm and the boosting algorithm. 
Information geometry is useful tool not only to better understand the 
quasi-Newton methods but also to design new update formulae. 

1 Introduction 

The main purpose of this article is to study the quasi-Newton methods from 
the view point of dualistic geometry or in other word information geometry 
[2j [26] [22] . Let us consider the unconstrained optimization problem 
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in which the function / : R n — >• R is twice continuously differentiable on 
R n . The quasi-Newton method is known to be one of the most successful 
methods for unconstrained function optimization. In quasi-Newton method 
a sequence {xk}^L C R n is successively generated in a manner such that 
x k+ i = x k — a k B^ l V f(x k ), where a k is a step length computed by a line 
search technique. The matrix B k is a positive definite matrix which is ex- 
pected to approximate the Hessian matrix V 2 f(x k ). The matrix B k and the 
step length a k are designed such that the sequence x k converges to a local 
minima of the problem (pQ). For the step length, the Wolfe condition |23|. 
Section 3.1] is a standard criterion to determine the value of a k . In terms of 
the approximate Hessian matrix, mainly there are two methods of updating 
B k to -Bfc+ii ° ne is called the DFP formula and the other is called the BFGS 
formula. 

We introduce the DFP and the BFGS methods. Let s k and y k be column 
vectors defined by 

s k = x k+1 -x k = -a k B^ l V }{x k ), Vk = Vf(x k+ i) - Vf(x k ), 

and suppose that sjy k > holds. In the DFP formula the approximate 
Hessian matrix B k is updated such that 

R R DFP \R- e 11 ] ■ R B kSkyl + VksjBk T y k yl y k yj 

B k+ \ — B [B k , s k ,y k \ .— B k = h s k B k s k —^ — — + — . 

s k Vk {s k ykV s k Vk 

(2) 

In the BFGS update formula, the matrix B k +i is defined by 

r _ R BFGSr R . a „, i d B k s k sjB k y k yj 

Bk+i — B [B k , s k , yk\ ■— E>k tt: V — , [o) 

s k B k s k s k [ y k 

Under the condition that B k € PD(n) and s k y k > 0, the matrices B DFP [B k ; s k , y k ] 
and B BFGS [B k ; s k , y k ] are also positive definite matrices. If there is no con- 
fusion, the update formulae B DFP [B; s, y] and B BFGS [B; s, y] are written as 
B DFP [B] and B BFGS [B] , respectively. In practice, the Cholesky decompo- 
sition of B k is successively updated in order to compute the search direction 
—B^ 1 Vf(x k ) efficiently [H]. Note that the equality 

B DFP [B;s,y]^ = B BFGS [B^;y,s] 

holds. Hence, we can derive the update formulae for the inverse H k = B^ 1 
without inversion of matrix. 
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Both the DFP and the BFGS methods are derived from variational prob- 
lems over the set of positive definite matrices [10]. Let PD(ra) be the set of all 
n by n symmetric positive definite matrices, and the function tjj : PD(ra) — > R 
be a strictly convex function over PD(n) defined by 

ip(A) = tr(A) - log det A. 

Fletcher |10| has shown that the DFP update formula ((2J) is obtained as the 
unique solution of the constraint optimization problem, 

1 /2 1 1 /2 

min ip{Bj B~ Bj ) subject to Bs k = y k , 

SGPD(n) K K 

where A 1 ! 2 for A G PD(n) is the matrix satisfying A 1 / 2 € PD(n) and 
(A 1 / 2 ) 2 = A. The BFGS formula is also obtained as the optimal solution of 

min ip(B k BB, ) subject to Bs k = yk, 

BSPD(n) K K 

in which B k denotes {B^ 1 ) 1 / 2 or equivalently (flV 2 ) - 1 . 

It will be worthwhile to point out that the function is identical to 
Kullback-Leibler(KL) divergence [21 [19] up to an additive constant. For 
P, Q € PD(n), the KL-divergence is defined by 

KL(P,Q) =tv(PQ~ 1 ) -logdet(PQ~ 1 ) -n 

which is equal to ^(Q -1 / 2 PQ~ l l 2 ) — n. The KL-divergence is regarded 
as a generalization of squared distance. Using the KL-divergence, we can 
represent the update formulae as the optimal solutions of the following min- 
imization problems, 

(DFP) min KL( B k ,B) subject to Bs k = y k , (4) 

BGPD(n) 

(BFGS) min KL(B, B k ) subject to Bs k = y k . (5) 

BePD(n) 

The KL-divergence is asymmetric, that is, KL(P, Q) ^ KL(Q, P) in general. 
Hence the above problems will provide different solutions. 

In the information geometry [2] , the KL-divergence defines a geometrical 
structure over the space of probability densities. Statistical inference such 
that the maximum likelihood estimator is better understood based on the 
geometrical intuition. Originally, the KL-divergence is defined as the dis- 
crepancy measure between two multinomial normal distributions with mean 
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zero. In this paper, we show that the information geometrical approach 
is useful to understand the behaviour of quasi-Newton methods. On the 
set of positive definite matrices, PD(ra), we define the so-called Bregman 
divergence which is an extension of the KL-divergence. The Bregman di- 
vergence induces a dualistic geometrical structure on PD(n). Then we can 
derive new Hessian update formulae based on the Bregman divergence. We 
present a geometrical view of quasi-Newton updates, and discuss the rela- 
tion between the Hessian update formula and the statistical inference based 
on the information geometry. 

Here is the brief outline of the article. In Section [2j we introduce the 
elements of information geometry based on the Bregman divergence, espe- 
cially over the set of positive definite matrices. In Section [3l an extended 
quasi-Newton formula is derived from the Bregman divergence. Section [4] 
is devoted to discuss the invariance property of the quasi-Newton update 
formula under the group action. In Section [U we discuss the sparse quasi- 
Newton methods [32] from the viewpoint of the information geometry, and 
point out that the sparse quasi-Newton method is closely related to sta- 
tistical methods such as the EM-algorithm [20] or the boosting algorithm 
|12[ [22] . We conclude with a discussion and outlook in Section [6l Some 
proofs of the theorems are postponed to Appendix. 

Throughout the paper, we use the following notations: The set of positive 
real numbers are denoted as M+ C M.. Let det^4 be the determinant of 
square matrix A, and GL(n) denotes the set of n by n non-degenerate real 
matrices. SL(n) C GL(n) is the set of n by n non-degenerate real matrices 
with determinant 1, that is, SL(ra) = {A S GL(n) | det^4 = 1}. The 
set of all n by n real symmetric matrices is denoted as Sym(n), and let 
PD(n) C GL(n) n Sym(n) be the set of n by n symmetric positive definite 
matrices. For P £ PD(n), the square root of P is denoted as P 1 / 2 which is 
defined as P For a vector x, \\x\\ denotes the Euclidean norm. For two square 
matrices A, B, the inner product (A, B) is defined by tv(AB T ), and ||^4||f 
is the Frobenius norm defined by the square root of (A, A). Throughout the 
paper we only deal with the inner product of symmetric matrices, and the 
transposition in the trace can be dropped. 

2 Bregman Divergences and Dualistic Geometry 
of Positive Definite Matrices 

We introduce Bregman divergences which are regarded as an extension of 
the KL-divergence. Then we illustrate a differential geometrical structure 



4 



defined from the Bregman divergence over the set of positive definite ma- 
trices. In sequel sections, we will provide a geometrical interpretation of 
quasi-Newton methods. For general Bregman divergences, however, the 
quasi-Newton update formula cannot be obtained in the explicit form. In 
order to obtain computationally tractable update formulae, we often use 
a specific Bregman divergence which is called the ^-Bregman divergence 
in this article. First, we define general Bregman divergences, and then we 
introduce the V-Bregman divergence as a special case of general Bregman 
divergences. We will show the associated geometrical structure on the set 
of positive definite matrices. 

2.1 Bregman divergences 

The Bregman divergence [7] is defined through the so-called potential func- 
tion. Below, we define the Bregman divergence over the set of positive 
definite matrices. 

Definition 1 (Potential function and Bregman divergence). Let ip : PD(ra) — > 
K be a continuously differentiable, strictly convex function that maps positive 
definite matrices to real numbers. The function <p is referred to as potential 
function or potential for short. Given a potential (p, the Bregman divergence 
D ip (P,Q) is defined as 



for P,Q G PD(n), where Vp(Q) is the n by n matrix whose element 
is given as ^p(Q)- 

The Bregman divergence D lfi (P, Q) is non- negative and equals zero if and 
only if P = Q holds. Indeed, due to the strict convexity of <p, the function 
p(P) lies above its tangents (p(Q) + (Vp(Q),P — Q) at Q. Hence, the 
non-negativity of the Bregman divergence D ip (P,Q) is guaranteed. Note 
that D ip (P,Q) is convex in P but not necessarily convex in Q. Bregman 
divergences have been well studied in the fields of statistics and machine 
learning [3 El [22]. 

Example 1. For P 6 PD(n) let the function tp be <p{P) = — logdet(P). 
Note that ^p{P) is a strictly convex function. Then, we have 



D V (P,Q) = <p(P) - <p(Q) - (V<p(Q),P 



Q) 



(6) 



(y<p(Q))a 







log det Q 



dQij 
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Hence the corresponding Bregman divergence is 

D V (P,Q) = - log detP + logdet Q + (Q-\P-Q) = (P, Q' 1 ) - logdet(PQ- 1 ; 

is identical to the KL- divergence on the multivariate normal distribution 
with mean zero M 



n, 



By replacing the KL-divergence in or ([5]) with a Bregman divergence, 
we will obtain another variational problem for the quasi-Newton method. 
In general, however, update formula cannot be explicitly obtained. Below 
we define a class of Bregman divergences called V-Bregman divergence. In 
Section [3l we show that the V-Bregman divergence provides an explicit 
update formula of the quasi-Newton method. 

We prepare some ingredients to define the V-Bregman divergence. Let 
V : K+ — > R be a strictly convex, decreasing, and third order continuously 
differentiable function. For the derivative V', the inequality V < holds 
from the condition. Indeed, the condition leads to V < and V" > 0, 
and if V'(zo) = holds for some zq € R+, then V'(z) = holds for all 
z > zq. Hence V(z) is affine function for z > zq. This contradicts the strict 
convexity of V . We define the functions vy : — > R and /3y : M+ — > R 
such that 

u v (z) = -zV'(z), (3 v (z) = = z ■ /logzv(z) 

Uy\Z) uz 

Since vy{z) > holds for z > 0, the function /3y is well defined on R + . The 
subscript V of vy and /3y will be dropped if there is no confusion. We now 
are ready to present the definition of V-Bregman divergence over PD(n). 

Definition 2 (V-Bregman divergence). Let V : R+ — > R be a function which 
is strictly convex, decreasing, and third order continuously differentiable. 
Suppose that the functions v and f3 defined from V satisfy the following 
conditions: 

P(z) < - {z > 0) (7) 
n 

and 

lim - f = = 0. (8) 

T/ie Bregman divergence defined from the potential <p(P) = V(det P) is 
called V-Bregman divergence, and denoted as Dy(P,Q). Not only V(detP) 
but also V{z) is also referred to as potential. 
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As shown in [26], the function V(detP) is strictly convex in P € PD(n) 
if and only if the potential V satisfies (J7J). The ^-Bregman divergence has 
the form of 

D V (P,Q) = V(detP)-V(detQ) + v(detQ)(Q- 1 ,P) -w(detQ). (9) 
Indeed, substituting 

(v * (Q))y = 9V dQl Q) = v ' {detQ) i^r = -"^w 1 ^ 

into ©, we obtain the expression of Dy{P, Q). The KL-divergence KL(P, Q) 
is represented as Dy(P,Q) with the potential V(z) = — logz. Below we 
show some examples of l^-Bregman divergence. 

Example 2. For the power potential V{z) = (1 — z 7 )/7 with 7 < 1/n, we 
have v{z) = z 7 and j3(z) = 7. Then, we obtain 

D V (P, Q) = (det Q) 7 { (P, Q~ 1 ) + 1 ~ (detPQ- 1 ) 7 _ ^ 

27ie KL-divergence is recovered by taking the limit of j — > 0. 

Example 3. For < c < 1, let us define V{z) = clog(cz + 1) — log(z). 
Then V(z) is a strictly convex and decreasing function, and we obtain 

2 

u{z) = 1 - c + — °— > 0, /3(z) = M ~°. Z S — n ^ 
cz + 1 (cz + l)(c(l — c)z + 1) 

for z > 0. XTie negative-log potential, V(z) = — logz, is recovered by setting 
c = 0. T/ie potential satisfies the bounding condition < 1 — c < v(z) < 1. 
As shown in the sequel the bounding condition of v will be assumed to 
prove the convergence property of the quasi-Newton method. 



2.2 Dualistic Geometry defined from Bregman Divergences 

The space of positive definite matrices has rich geometrical and algebraic 
structures [26] Here we introduce dualistic geometrical structure on PD(n) 
induced form the Bregman divergence. See \22\ [25] for details. 

We introduce two coordinate systems on PD(ra). The 77-coordinate sys- 
tem n : PD(n) — > PD(n) is defined as 

V(P) = P, 
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which is the identity function on PD(ra). The definition of the other coordi- 
nate system requires the potential <p for the Bregman divergence D lfi (P,Q) 
in ([6]). Let us define the ^-coordinate system as 

e v (P) = v<p(P) 

Note that the matrix 9 V (P) is not necessarily a positive definite matrix. 
Indeed, for the potential ip(P) = — logdetP, we have 0<p{P) = — P~ l which 
is a negative definite matrix. The function 9 V is, however, one-to-one map- 
ping. Hence 9 V (P) works as the coordinate system on PD(n). The inverse 
function of V</? is expressed by the conjugate function of <p. The convex 
function <p has the dual representation called Fenchel conjugate, which is 
defined as 

<p*(P)= sup {(P,Q)-<p(Q)}. (10) 

Q€PD(n) 

Then, we have 

v^(p) = (v^r i (p) = (^)- i (p) 

on the domain of ip* \30\ Theorem 26.5]. For any potential <p, the ij- 
coordinate system is common and only the ^-coordinate system depends 
on the potential. 

For the potential V of the ^-Bregman divergence, the ^-coordinate 
system is denoted as 9y{P), which is given as 

e v (P) = -v{P)p-\ 

Thus 9y(P) is a negative definite matrix for P £ PD(n). 

Let us define the flatness of a submanifold in PD(n). See [2] for the 
formal definition of the flatness with terminologies of differential geometry. 

Definition 3 (autoparallel submanifold). Let A4 be a subset o/PD(n). If 
M. is represented as an affine subspace in the rj- coordinate, then M. is called 
rj- autoparallel submanifold. If A4 is represented as an affine subspace in 
the 9 u>- coordinate, then A4 is called 9 autoparallel submanifold. When an 
rj- autoparallel submanifold M. is also 9 ^-autoparallel, A4 is called doubly 
autoparallel submanifold. 

For the potential <p{P) = V(det P), the ^-coordinate and the ^-autoparallel 
is denoted as the #y-coordinate and the ^-autoparallel, respectively. For- 
mally, the flatness is defined from the connection on the differentiable man- 
ifold [21 [18]. Here, we adopt a simplified definition. 
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Example 4. Let V(z) be the negative logarithmic function V(z) = — log(z), 
then we have v{z) = 1. The r/- coordinate system is defined as r](P) = P, 
and the Qy -coordinate system is given as Oy(P) = — P . For two vectors 
s, y G M n we define the submanifold M which represents the secant condition 
such that 

M = {B G PD(n) | Bs = y}. 

Suppose Ai 7^ 0, then we see that Ai is doubly autoparallel, since 

M = {B G PD(n) | r]{B)s = y} = {B G PD(n) | ^y(S)y = -s} 

holds. That is, Ai is represented as the ajfine subspace in both the ij- 
coordinate system and the By -coordinate system. 

2.3 Extended Pythagorean Theorem 

The projection of a matrix in PD(n) onto an autoparallel submanifold is 
defined below. Then, we introduce the extended Pythagorean theorem. 

Definition 4 (projection). Let if be a potential, Q be a positive definite 
matrix. An rj- autoparallel submanifold in PD(n) is denoted as A4. The 
matrix P* G A4 is called 9 „ -projection of Q onto M, when the equality 

(0<p(Q) - o v (P*), v(P) - v(p*)) = o, V G M 

holds. Let M be a 6 ^- autoparallel submanifold in PD(n). The matrix P* G 
A/" is called n-projection of Q onto N when the equality 

(r?(Q) - n {p% e v {p) - e v (p*)) = o, v p g M 

holds. 

Let £ be a one-dimensional ^-autoparallel submanifold defined as 

£ = {PGPD(n)| 3 t GR, 9 v (P) = (l-t)0 v (Q)+t6 v (P*)}. 

When P* is the ^-projection of Q onto A4, the 77-autoparallel submanifold 
M. is orthogonal to C at P* with respect to the inner product (•,•). In the 
r\- projection, also the same picture holds by replacing n and 9 V . 

Theorem 1 (Extended Pythagorean Theorem [21 [22]). Let (p be a potential 
function, Ai be an n- autoparallel submanifold in PD(n), and Q be a positive 
definite matrix. Then, the following three statements are equivalent. 
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(a) P* is a Op-projection of Q onto Ai. 

(b) P* G Ai satisfies the equality 

D tp (P,Q)=D ip (P,P*) + D (p (P*,Q) (11) 

for any P € Ai. 

(c) P* is the unique optimal solution of the problem 

min DJP,Q) subject to P € Ai. (12) 

P6PD(n) ^ 

Proof. For any P,P*,Q £ PD(n) the equality 

D^P, Q) - ^(P, P*) - D V (P*,Q) = (9 V (Q) - e v (P*), n{P*) - V (P)) 

(13) 

holds. The equivalence between (a) and (b) follows the above equality. If 

(b) holds, then the non-negativity of the divergence assures that P* is an 
optimal solution of f)12|) . The uniqueness follows the strict convexity of the 
divergence D V (P, Q) in P. Hence (c) holds. Finally, we show that (a) follows 

(c) . Let P* be an optimal solution of (fT2]) . The 77-autoparallel submanifold 
A4 is represented by 

M = {P£PB(n) I (r ] (P),A l )=b l ,i=l,...,k} 

in which Ai is an n by n real matrix and hi € M. for i = 1, . . . , k. The 
optimality condition of ()12j) yields that 

k 

-0<p(F*) + V (Q) = ^2\iAi, Aj G K. 

i=l 

with some Ai, . . . , In addition, the fact that both P and P* are included 
in A4 leads to the equalities 

{ V (P*)- V (P),Ai} = 0, i = l,...,k. 

Therefore, we obtain 

(9 tp (Q)-e (p (P*),r ] (P*)- V (P))=0 

for any P € A4. This implies that P* is a ^^-projection of Q onto □ 
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The uniqueness of the ^-projection onto the 7/-autoparallel submanifold 
is shown through the equivalence between (a) and (b) in Theorem [TJ The 
similar argument is valid for r/-projection onto ^-autoparallel submanifold. 
We show the result without proof. 

Theorem 2. Let if be a potential function, M be a O^-autoparallel subman- 
ifold in PD(n), and Q be a positive definite matrix. Then, the following 
conditions (a) and (b) are equivalent. 

(a) P* is an n-projection of Q onto AT. 

(b) P* € TV satisfies the equality 

D ip {Q,P) = D v {Q,P*) + D ip {P*,P) (14) 

for any P £ M . 

When (a) or (b) holds, P* is the unique optimal solution of the problem 

min DJQ,P) subject to P E M. (15) 

PgPD(n) 

The Bregman divergence D^(Q, P) may not be convex in P, and hence 
the conditions (a) or (b) in Theorem ([2]) is not necessarily derived from the 
optimality condition of (fl~5|h 

As shown in Section [IJ the BFGS/DFP update formulae are derived 
by minimizing the KL-divergence. Example H] shows that the submanifold 
associated with the secant condition A4 = {B € PD(n) | Bs^ = y^} is 
doubly autoparallel with respect to the flatness defined from the potential 
V(z) = — logz. Thus, we obtain the following geometrical interpretation, 

BFGS update: #y-projection of Bk onto the ry-autoparallel submanifold 
M, 

DFP update: ^-projection of B^ onto the #y-autoparallel submanifold M.. 

FigureQ]presents the geometrical view of the standard quasi-Newton updates 
based on information geometry. 

3 quasi-Newton Methods based on Bregman Di- 
vergences 

We consider quasi-Newton update formulae derived from variational prob- 
lems with respect to Bregman divergences. As shown in Section [lj the 
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Secant condition: doubly-autoparallel 

Figure 1: Geometrical interpretation of quasi-Newton updates. For the 
potential V(z) = — log z, the submanifold M. defined by the secant condition 
is doubly autoparallel with respect to r]- and #y-coordinate systems. The 
BFGS formula B BFGS [B k ] is given as the #y-projection of B k onto the 77- 
autoparallel submanifold A4, and the DFP update -B DFP [.Bfc] is given as the 
^-projection of B k onto the #y-autoparallel submanifold M.. 

standard quasi-Newton updates are derived from the minimization problem 
of the KL-divergence. We show that Bregman divergences lead extended up- 
date formulae. In addition, an explicit expression of the extended Hessian 
update formula is presented. 

We consider the minimization problem of the Bregman divergence in- 
stead of the KL-divergence. The extended BFGS update formula is given 
as the optimal solution of 

min DJB,B k ), subject to Bs k = y k . (16) 

SSPD(n) 

Suppose that the optimal solution B k+ \ exists. Then B k+ \ is the unique 
^-projection of B k onto the submanifold defined from the secant condition. 
On the other hand, as the extension of the DFP update, we consider the 
problem, 

min D ip (B~ 1 , B^ 1 ), subject to Bsk = yk- (17) 

-BePD(n) 

instead of the minimization of KL(i?fc, B) = KL(i? _1 , B^ 1 ). In the similar 
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way, we can derive the quasi-Newton methods for the approximate inverse 
Hessian matrix H k = B^ 1 . 

In the following we focus on the extension of the BFGS method (|16p . 
since the same argument is valid for the extension of DFP method. A formal 
expression of the optimal solution is presented in the theorem below. 

Theorem 3. Suppose that there exists an optimal solution (|16p . Then the 
optimal solution B k+ \ is unique and satisfies 

Bk+i = Vip*(Vip(B k ) + s k X T + XsJ), B k+1 s k = y k , 

where A G W 1 is a column vector and ip* is the Fenchel conjugate function 
off- 
Proof. Since (|16p is a convex problem and the objective function D Lp {B, B k ) 
is strictly convex in B, we see that the optimal solution is unique if it exists. 
Suppose that B k+ \ is the optimal solution of (fT6|) . then B k+ \ satisfies the 
optimality condition. According to Giiler, et al. |16| . the normal vector of 
the affine subspace A4 = {B G PD(n) | Bs k = y k } is characterized by the 
form of 

s k X T + XsJ G Sym(n), A G R n . 

In fact for B\,B 2 G M we have 

(s k X T + XsJ, B x - B 2 ) = X T B lSk + sjBiX - X T B 2 s k - sj B 2 X 

= A T y fc + yJX - X T y k - yJX 
= 0, 

and thus s k X T + XsJ is a normal vector of A4. Giiler, et al. [16] have shown 
that the normal vector is restricted to the expression above. Hence, for the 
optimal solution B k+ i there exists A G M n such that VD V (B, Bk)\ B=B = 

s fcA T + XsJ and B k s k = y k hold. The first equality is represented as 
\7(p(B k+ i) — V(f(B k ) = s k X T + XsJ . The existence of B k+ i assures that 
B k+ i = V ip* (y ip{B k ) + s k X T + XsJ), where ip* is the Fenchel conjugate of 
<p defined in ([10]). □ 

For general Bregman divergences, we do not have the explicit expression 
of the Hessian update formula. As a special case, we consider the minimiza- 
tion problem of the ^-Bregman divergence, 

V-BFGS: min D v (B,B k ), subject to Bs k = y k . (18) 

BGPD(n) 
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The update formula obtained by the problem above is referred to as the 
F-BFGS update formula. The theorem below shows an explicit expression 
of the F-BFGS update formula. 

Theorem 4 (y-BFGS update formula). Suppose the function V is a poten- 
tial function defined in Definition^ Let B k € PD(n), and suppose sjy k > 0. 
Then the problem (|18|) has the unique optimal solution B k+ \ € PD(ra) sat- 
isfying 

k+1 v(detB k ) \ u(det B k ) J s Jy k 1 ' 

Though the theorem is proved in [T7] , the proof is also found in Appendix 
|A]of the present paper as a supplementary. In the same way, we can obtain 
the explicit formula of the I/-DFP update formula, which is the minimizer 
of Dy(B~ l ,-B^T 1 ) subject to Bs k = y k . The update formula is equivalent to 
the self-scaling quasi-Newton update defined as 



B k+1 = e k B BFGS [B k - s k , y k ] + (1 - fl fc )^S, (20) 



where k is a positive real number. Various choices for 9 k have been pro- 
posed, see [291 A popular choice is 6 k = sjy k /sj B k s k . In the F-BFGS 
update formula, the coefficient 6 k is determined from the function v. 

We present a practical way of computing the Hessian approximation 
(|19p . Details are shown in the sequel [T7]- In Eq (|19p . the optimal solution 
B k+ \ appears in both sides, that is, we have only the implicit expression 
of B k+ \. The numerical computation is, however, efficiently conducted as 
well as the standard BFGS update. To compute the matrix B k+ \, first we 
compute the determinant detB k+ i. The determinant of both sides of (fl"9j) 
leads to 

detB t+1 = S f^*>rf . ,(det Bfc+1 )-'. (21) 

+ v(detB k ) n ~ l y + J y ' 

Hence, by solving the nonlinear equation 
det(B BFGS [B k ;s k ,y k \) 



v(det B 



in— 1 



v(z) n ~\ z>0 



we can find detB k+ i. As shown in the proof of Theorem the function 
z/v(z) n ~ 1 is monotone increasing. Hence the Newton method is available 
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to find the root of the above equation efficiently. Once we obtain the value 
of det -Bfc+i, we can compute the Hessian approximation B k+ i by substi- 
tuting det£>fc + i into Eq (fT9j) . Figure [2] shows the update algorithm of the 
y-BFGS formula which exploits the Cholesky decomposition of the approx- 
imate Hessian matrix. By maintaining the Cholesky decomposition, we can 
easily compute the the determinant and the search direction. The con- 
vergence property of the quasi-Newton method with the V^-BFGS update 
formula is considered in |17| . 

Example 5. We show the V-BFGS formula derived from the power poten- 
tial. Let V(z) be the power potential V(z) = (1 — z" 1 )/^ with 7 < 1/n. As 
shown in Example^ we have v(z) = z 7 . Due to the equality 

det(B BFGS [B k ;s k ,y k ]) = det(B k )^^- 

s k B k s k 

and Eq. f)21 1) . we have 

v(detB k+1 ) _ ( sjy k \ p _ 7 

v{detB k ) ~\sjB k s k ) ' p -l-(n-l) 7 ' 

Then the V-BFGS update formula is given as 

\s k B k s k J \ \s k 'B k s k J J s k 'y k 

For 7 such that 7 < 1/n, we have — l/(n — 1) < p < 1. In the standard 
self-scaling update formula (|20j) . the above matrix B k+ \ with p = 1 is used, 
while it is not derived from the strictly convex potential function. 



4 Invariance of Update Formulae under Group Ac- 
tion 

In this section we study the invariance of the V^-BFGS update formula (|19j) 
under the affine coordinate transformation of the optimization variable. For 
the minimization problem of the function f(x), let us consider the variable 
change of x. For a non-degenerate matrix T £ GL(n), the variable change 
is defined by 

x = T _1 x, (22) 



15 



F-BFGS update: 

Initialization: The function u(z) denotes —V'{z)z. Let Bq £ 
PD(n) be a matrix which is an initial approximation of the 
Hessian matrix, and LqLq = Bq be the Cholesky decomposi- 
tion of Bq. Let xo € R n be an initial point, and set k = 0. 

Repeat: If stopping criterion is satisfied, go to Output. 

1. Let Xfc+i = x k — a k B7 l V f (xk) , where a k > is a step 
length satisfying the Wolfe condition [231 Section 3.1]. 
The Cholesky decomposition B k = L k L^ is available to 
compute B^ l V f{x k ). 

2. Set s k = x k+l - x k and y k = Vf{x k+X ) - V/(x fc ). 

3. Update L k to L which is the Cholesky decomposition of 
B BFGS [B k ;s k ,y k ], that is, 

IV = B BFGS [B k ;s k ,y k ] = B BFGS [L k L T k ; s k ,y k ). 

The Cholesky decomposition with rank-one update is 
available. 

4. Compute 

(detZ) 2 



C 



K(detL fc )2)«-i 
and find the root of the equation 

C ■ v{z) n ~ x = z, z>0. 

Let the solution be z* . 

5. Compute the Cholesky decomposition L k+ i such that 

fc+1 fc+1 ~-((detL fc ) 2 ) LL + V v{{tetL k y))4y- k 

6. k<-k + l. 

Output: Local optimal solution x k . 



Figure 2: Pseudo code of F-BFGS method. The Cholesky decomposition 
with rank-one update is useful in the algorithm. 
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then the function f(x) is transformed to f(x) defined as 



f(T x). 



Then we have 



V/(x) = (T T )- 1 V/(T~ 1 5?), V 2 /(x) = {T T y\V 2 f(T- l x))T- 1 . 



Our concern is how the point sequence {x k } k L 1 generated by the F-BFGS 
method is transformed by the variable change (|22p . 

We consider the Hessian approximation matrix under the variable change. 
Let B k € PD(n) be the Hessian approximation computed at the k-th. step 
of the y-BFGS update for the minimization of f(x). We now define 



Let .Bfc+i be the Hessian approximation matrix updated from B k for the 
function fix), where we suppose that the V-BFGS method is used for the 
minimization of f(x). We consider the relation between B k+ \ and B k+ \. 
The updated point x k+ \ is determined by 



where a k is a non-negative real number determined by a line search. Then 



f(x k - a k B- l Vf{x k )) = f(T(x k - a k B- l Vf{T' l x k ))) = f(x k - a k B^V f(x k )). 

(23) 



Let a k be the step length for the function f(x) at the k-th. step of the V- 
BFGS method. Due to the equality ([231) . we see that the step length a k is 
identical to a k , if the line search with the same stopping rule is applied for 
both f(x) and f(x). As the result, the equality x k+ \ = Tx k+ \ holds under 
the condition a k = a k . Let ~s k and y k be 



x k Tx k 



B k = (T ) B k T 



x k+ i = x k - a k B k Vf(x k ) 



we have 



Sfe Xfc+l x k , 



Vk = Vf(x k+1 ) - V/(x fc ) 



then we obtain the equalities, 



Sk = Ts k 



y k = (T T )- 1 y fc . 



We consider the condition of T such that the equality 
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ICQ = Txq 



_1 V/(x ) 



Figure 3: The coordinate transformation between x of the function / and x 
of the function / is depicted. The initial point xq is transformed to xq = Txq 
and the search direction at xq is also transformed to — Bq 1 X7 /(xq). The 
quasi- Newton method is applied to both f(x) and fix), and then the points 
Xk and Xk are obtained in each coordinate system. If the equality T~ 1 Xk = 
Xk holds, the optimization algorithm is invariant under the transformation 
with T. 

holds, when Xk = Txk and Bk = (T T )~ 1 Bi c T~ 1 are satisfied. For such T, 
the equality Xk+i = Txk+i recursively holds. This implies that the point 
sequence obtained by the F-BFGS method is invariant under the affine 
transformation (|22|) . In the optimization of f(x) by the T^-BFGS method, 
the matrix Bk is updated to B^+i such that 



u{detB k+1 ) bfgSiv ,~ ~i, A KdetBfc+i) \ Vml_ 

-Dfc-1-1 = ~ B [Bk, Sk, Dk\ + 1 ; ~ T ~ ■ 

V i/(detS fe ) / s k 'y k 



i/(det Bjt) 
Some calculation yields that 



T T B k+1 T 



^ B ^) B BFGS [Bk . Sfe) yfc] + A _ KdetBfc +1 )\ 
^(detSfc) V i/(det5 fc ) / s^yk 



(24) 



The following theorem provides a sufficient condition on T such that T T Bk+\T 
Bk+i holds. 
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Theorem 5. Suppose thatT € SL(n), that is, det(T) = 1. Then the equality 
T T B k+ iT = .Bfc+i holds for any V-BFGS update formula. 

Proof. Due to the assumption det(T) = 1, we have det(B k ) = det(B k ). 
Then Eg. (j 24p is equivalent with 



T-B k+1 T = ^ff*^ B BFGS [B k ; s k ,y k ] + ( 1 Kdet£ fc+1 )W 



y(detSfe) " V ^(det5 fc ) / s Jy k 

Hence, the determinant of T T B k+ \T yields the equality 

det(B fc+ i) _ det (B BFGS [B k ; s k ,y k }) 



n-l 



^(detBfc+i)™- 1 v(detB k ) 

where det(T T .Bfc_|_iT) = deti^+i is used. On the other hand, the matrix 
B k+ \ defined by the F-BFGS update formula (fT9j) also satisfies, 

det(B k+1 ) _ det (B BFGS [B k ; s k ,y k }) 



n— 1 



^(detSfc+O™- 1 u(detB k ) 

As shown in the proof of Theorem HI the function z/v(z) n ~ l is one to one 
mapping, and thus we have detB k+ i = det.Bfc +1 . Therefore, the equality 
T T B k+l T = B k+1 holds. □ 

Next, we study the variable change with T £ GL(n). Below we assume 
= 1 without loss of generality. Let us define 

b k = det B k , b k+1 = det B k+ i , b k+1 = det B k+1 , t = det T 

and 

det B BFGS [B k ;s k ,y k ] 
a v{detB k ) n ~ l 

In the update formula, the determinant of B k+ i leads the equality 

b k+1 = a-u(b k+1 ) n - 1 . (25) 

The matrix B k+ \ satisfies the update formula (|24p . thus the determinant of 
both sides yields the equality 

(~ \ n—l 

v{b k+l )v{b k ) \ 
v(b k t-i) ■ (26) 
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When T T B k+ \T = B k+ \ holds, Eg. ([26]) is represented as 

' u(b k+1 t- 2 )u(b k )^ r 



■>k+i 



v(b k t~ 



(27) 



We consider the function v which satisfies (|25p and (|27p simultaneously. 
For a positive number a > 0, let b a be the unique solution of the equation 
of b, 

b = a-v(b) n -\ b>0, 

and E v = {b a £ R | a > 0} be the set of all possible solutions of the above 
equation. Note that 1 € E v holds for any v since 1 = 1- v{l) n ~ l holds. 

Theorem 6. Let v{z) > be a differentiable function on R + . Suppose 
that there exists an open subset E C R satisfying 1 E E C E„. For the 
Hessian approximation by the V-BFGS method, suppose that the equality 
B k+1 = {T T )' 1 B k+l T- 1 holds for all T € GL(n), all B k € PD(ra) and all 
s k ,y k € R n satisfying sjy k > 0. Then the function v is equal to v{z) = z 1 
with some 7 £ R. 

Note that E v = R + holds for v{z) = z 7 unless 7 = l/(n — 1). 

Proof. Under the assumption, the equations (|25[) and (j27|) share the same 
solution b k+ \ for any a > 0, b k > and f 7^ 0. Let = 1, x = t~ 2 > 0. For 
any positive a and x, equations (|25p and (|27p lead to 



6 n = a ■ i/(6a) n_1 and 6 a = a ■ ^x)"- 1 ( 44V ' = « 



z^(x) / u ( x ) 



n-1 



for 6 a € E u . Hence we obtain 



v{b a x) = v(b a )u(x), a > 0, x > z/(6x) = v(b)v(x), b £ £^,, x > 0. 

(28) 

The assumption on guarantees that 1 + e € E v holds for any infinitesimal 
e. Thus Eq. ([28|) leads the following expression, 

v(x(l + e)) - u{x) _ u{x) + e) - 
xe x e 

Taking the limit e — > 0, we obtain the differential equation, 

*/(x) = z/(l)^, 1/(1) = 1, 

X 

and the solution is given as z/(x) = x u '^\ □ 
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As shown in Example[2j the function v(z) = z 1 is derived from the power 
potential V{z) = (1 — z 1 )/^. In robust statistics, the power potential has 
been applied in wide-rage of data analysis [HIM]. 

Remark 1. Ohara and Eguchi [26] have studied the differential geomet- 
rical structure over PD(n) induced by the V -Bregman divergence. They 
pointed out that the geometrical structure is invariant under SL(n) group 
action. Furthermore, they have showed that for the power potential V(z) = 
(1 — z 1 )/^, the By- (rj-) projection onto rj- (By-) autoparallel submanifold is 
invariant under GL(n) group action. It turns out that only the orthogonality 
is kept unchanged under the group action. The other geometrical features 
such as angle between two tangent vectors are not preserved in general. The- 
orem [6] indicates that the invariance of the geometrical structure on PD(n) 
is inherited to the invariance of point sequences of quasi-Newton methods 
under the affine transformation. 

In summary, we obtain the following results. Suppose that xq = Txq, Bq = 
(T t )~ 1 BqT~ 1 holds. Let {x^} and {x^} be point sequences generated by 
the y-BFGS method for the functions f(x) and f(x), respectively. Suppose 
that the line search with the same stopping rule is used for the step length. 
Then, for any T G SL(n) the equality Xk = Txk holds for all k > 1. More- 
over the equality x^ = Tx^, k > 1 holds for any T G GL(ro) if and only if 
the function V(z) is the power potential. 

5 Geometry of Sparse quasi-Newton updates 

Sparse quasi-Newton method exploits the sparsity of Hessian matrix in order 
to reduce the computation cost [32]. The sparsity pattern of the Hessian 
matrix at a point x G R n is represented by an index set F satisfying 

| (V 2 f(x))ij / 0}cF. 

When the number of entries in F is small, the matrix V 2 f{x) is referred to 
as sparse matrix. We assume that G F holds for (i,j) G F and that 
(i, i) G F for all i = 1, . . . , n. Given a sparsity pattern F, the set of sparse 
matrix is defined by 

S = {P G PD(n) | Pij = for F}. 

Clearly the submanifold S is 77-autoparallel in PD(n). 

Yamashita [32] has proposed a sparse quasi-Newton method. In this sec- 
tion we show an extension of sparse quasi-Newton method and illustrate a 



21 



geometrical structure of the update formula. First, we briefly introduce the 
sparse quasi-Newton method proposed by Yamashita [32] , Suppose be 
an approximate inverse Hessian matrix at the fe-th step of the sparse quasi- 
Newton method. Let be the updated matrix of by the existing 
quasi-Newton methods such as the BFGS or the DFP method for the ap- 
proximate inverse Hessian matrix. In the computation of , we need only 
the elements (H^ N )ij for € F, and thus efficient computation will be 
possible even if the size of the matrix is large. Then, compute the sparse ma- 
trix Hf. + i € S satisfying the constraint (-£ffc+i)«j = (il? )ij for all € F. 
The calculation of -fffc+i from is regarded as the #y-projection with 
respect to the KL-divergence. The sparse clique-factorization technique 
|13[ [T5] is available for the practical computation of the projection. See 
[32] for details. 

For the computation of both and -fffc+i in the sparse quasi-Newton 

method, we can use Bregman divergence instead of the KL-divergence. Fig- 
ure U] shows an extended sparse quasi-Newton method for the approximate 
Hessian matrix B k . Figure [5] illustrates the geometrical interpretation of the 
extended sparse quasi-Newton updates. 

We have some choices in the algorithm of Figure HI (i) the Bregman 
divergence in Step 2, (ii) projection in Step 3, and (iii) the number of T. In 
the sparse quasi-Newton updates presented by Yamashita [32] , the number 
of iteration is set to T = 1; in Step 2, the standard BFGS/DFP method 
for the approximate inverse Hessian is used; in Step 3 the #y-projection 
defined from the KL-divergence is computed. Moreover, the superlinear 
convergence has been proved, see |32] for details. In the following, we present 
the geometrical interpretation of the sparse quasi-Newton method. Then we 
show a computation algorithm for the update formula derived from the V- 
Bregman divergence. 

5.1 Geometry of Sparse quasi-Newton update 

We consider the sparse quasi-Newton update formula from the geometrical 
viewpoint. Remember that A4 is the set of matrices satisfying the secant 
condition 

M = {Be PD(n) | Bs k = y k }. 
Below we consider two kinds of update formulae: 

Algorithm 1: In the algorithm in Figure U the matrix is defined as 
the 77-projection of onto M, that is, B" is equal to B DFP [B^; Sf., t^]. 
Then B^ t+1 ^ is defined as the ^-projection of B^> onto S. 
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Extended sparse quasi-Newton update algorithm: the Hes- 
sian approximation at the fc-th step of quasi-Newton 
method is updated to a sparse matrix B^+i- Suppose that 
the Bregman divergence is defined from the potential function 
and let S be the set of sparse matrix defined by a fixed index 
set F. 

Initialization: Let T be an positive integer, and B^ := B^. 
Repeat: t = 1,2,... ,T. 

1. Compute the partial matrix of- ^ for £ F from 

by using the extended quasi-Newton method such 
as CED or (fTT]). 

2. Compute the sparse matrix B^ G S which is the 0^- 
projection of onto S. 

Output: The updated approximate Hessian matrix B^+i is given 
as B^ € S. 

Figure 4: An extension of sparse quasi-Newton method is presented. The 
approximate Hessian Bk is updated to B^+i by exploiting the update for- 
mula with Bregman divergences. 



Algorithm 2: In the algorithm in Figure EJ the matrix is the 6^- 
projection of B^ onto A4, that is, B^ is given as the optimal solution 
of (|16|) . Then B^ t+1 ^ is defined as the ^-projection of B^ onto S. 

The difference between Algorithm 1 and Algorithm 2 is the projection onto 
A4 to obtain flw, Below we show the theoretical properties for each algo- 
rithm. 

In Algorithm 1, we consider how the Bregman divergence D^B^^B 1 ^) 
is updated. Let = B^ G <S and suppose that the ^-projection onto S 
exists. Then, the extended Pythagorean theorem in Section [2.31 leads that 

D V (B®,5®) = D V (B^,B^) + D v (B^ t+1 \B^) 

= D v {B W , £?( t+1 ) ) + D v (B^ , B^+V ) + D v (B^ ,B®) 
>D v (B^ t+1 \B^) 
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Figure 5: Geometrical illustration of the extended sparse quasi-Newton up- 
date algorithm. 

and hence we have 

D V (B^B^)>D V (B^\BW) >--->d^ t \b^). 

This indicates that under a mild assumption the Bregman divergence 
will converge to zero and that BW £ 5 will also converge to a matrix in 
A4 H«S. A condition on the convergence has been investigated by Bauschke, 
et al. [5j. This update algorithm is similar to the so-called em-algorithm 
[TJ [8] which is a popular algorithm in statistics and machine learning. In the 
em-algorithm, the 77-projection and the ^y-projection with V(z) = — log z 
is repeated in the probability space. Then, the maximum likelihood estima- 
tor under the partial observation is computed. In the context of statistical 
estimation, usually the em-algorithm is conducted when Jvi H S = holds. 
Under some assumption with KA PiS = 0, the point sequences (B^\ B«) € 
S x A4 converges to the pair of the closest point (B*,B*) € S x A4 such 
that (B* , B*) is the optimal solution of the optimization problem, 

min D V (B,B), 
(B,B)eSxM 

see |20j for details. We believe that to provide a simple characterization 
about the convergence point (B*,B*) under the condition M. n S 7^ is an 
open problem. 
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Next, we investigate Algorithm 2. Likewise we suppose 
Note that A4DS is 77-autoparallel. Let 5* be the ^-projection of B k = 
onto the intersection A4f]S. Then the extended Pythagorean theorem leads 
that 

D V {B\ B®) = D V {B\ flW) + D V (B®,B®) 

= D V {B*,B^) + D^{B^ 1 \B^) + D V {B^,B^) 
>D V (B\B^) 

and hence we have 

D V {B\ > D V (B*, B«) > > D^B*, B^). 

Suppose that JB^ converges to .Er 00 ) € .M D 5 when T tends to infinity, 
then the equality = B* holds as shown below. From the definition of 

B* and the extended Pythagorean theorem, we have 

d v {b^\b^) = d v (b(°°\b*) + d v (b*,bW). 

Due to the continuity of the Bregman divergence, for T — > oo we have 

= £> V (B(°°), B(°°>) = D V {B^\B*) + D V (B*,B^), 

and hence B^ 00 ) = £?* holds. As the result we have limy^oo £>( T ) = £?*. 
Figure [6] shows the geometrical illustration of the Algorithm 2. Applying 
Theorem 8.1 of Bauschke and Borwein [6], we see that the convergence of 
B^ to the point B* is guaranteed under the Bregman divergence associated 
with power potential with 7 < 0. The iterative update procedure is closely 
related to the boosting algorithm [12|. [22] in which the iterative Bregman 
projection is exploited to compute the estimator for classification problems. 

As argued above, it is not guaranteed that in Algorithm 1 con- 
verges to B*, which is the ^-projection of B^ = B^ onto A4 PI S. On the 
other hand the sequence B^' in Algorithm 2 converges to B* under mild 
assumption. From the viewpoint of the least-change principle, the sparse 
quasi-Newton method with Algorithm 2 will be preferable. Fletcher has 
proposed the sparse update formula using B*. The update formula using 
the matrix B* requires the sparsity and the secant condition simultaneously, 
and hence, the approximate Hessian can be ill-posed when (sk)i = for some 

i m- 
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Figure 6: Geometrical interpretation of Algorithm 2. The sparse matrix 
will converge to B* which is the 9^ projection of B^ = Bk € S. 

5.2 Computation of Projections 

We consider the computation of the extended sparse quasi-Newton updates. 
In Algorithm 1 and 2 above, we need to compute the (9^-projection of a 
matrix B onto the 77-autoparallel submanifold S consisting of sparse positive 
definite matrices. Generally the ^-projection does not have the explicit 
expression. Here, we study only the #y-projection based on the y-Bregman 
divergence. 

According to Yamashita [32] , we briefly introduce the computation of the 
projection onto S, when the geometrical structure is induced from the KL- 
divergence. For a given matrix € Ai, the projection onto S, denoted as 
B^ t+1 \ is obtained as the optimal solution of 

min KL( B.B®), s.t.BeS. 

BSPD(n) 

Some calculation yields that B^ t+l > is also the optimal solution of 
max detB" 1 , B.t. = (-ff w )y (i,j)€F. 

-BGPD(n) 

Let F be F = F\{(i, i) \i = l,...,n}. If the graph G = ({1, . . . , n}, F) is 
chordal, the existence of the optimal solution is guaranteed [32 ^ I13 |. IT5] . The 
inverse of the optimal solution, {B^ t+l ^)~ l , is represented by using the sparse 
clique- factorization formula [13} [32] , and then the updated inverse Hessian 
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matrix is obtained. The sparse clique-factorization formula of 1 is 

represented by 

= L J L T . . . L j iDLe i . . . L 2 Li 

in which L r (r = 1, ...,£ — 1) are lower triangular matrices, and D is a 
positive definite block-diagonal matrix consisting of I diagonal blocks. The 
number of £ is determined by the the number of maximal cliques of the 
graph G = ({1, . . . , n}, F), and all elements of L r (r = 1, . . . , t — 1) and D 
are explicitly computed from (H^)ij, € F. We generalize the above 
argument to the projection with the y-Bregman divergence. 

Theorem 7. Let F be F = F\{(i,i) \ i = 1, . . . ,n}, and suppose that the 
undirected graph ({1, . . . , n}, F) is chordal. Let € A4. Then there exists 
the Oy -projection of onto S, and the projection is the optimal solution 
of the following problem, 

min det(S), subject to (9v{B))ij = {Qy(B®))ij, (i,j) € F. (2 9) 

Proof. Remember that 9y(P) is defined as Oy{P) = — v{det P)P~ l which 
is a negative definite matrix. It is easy to see that the mapping —9y(P) 
is bijection on PD(n). Hence, the assumption on the graph ({1, . . . ,n},F) 
guarantees that the problem 

max det(-e v (B)), (9y(B))ij = (6y(B^))ij for all G F (30) 

-BgPD(n) 

has the unique optimal solution B* , and the optimal solution satisfies (— 8y(B*))~ 
S, as shown in [15 } I13 |. [32] . In terms of the objective function, we see that 

det(-e v (B)) = det(v(det B)B~ 1 ) = \ J . 

det B 

The function v(z) n / z is strictly monotone decreasing for z > 0. Indeed, 

d , u(z) n n( , 1\ 
■log-^- = - h3(*)-- <0 



dz z z\ n 



holds. Thus, the optimal solution of (|30j) is identical to that of (|29j) . We 
find that B* £ S holds, since {-9 V {B*))" 1 = u(det B*)~ l B* € S holds. For 
any B G 5, we have 

D v (B,B®)-D v (B,B*)-D v (B*,bV) = Y^{6 v {B^)-6 v {B*)) ij {B*-B) ij 

= 0. 
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The second and third equalities follows (6y(B^) — 6y(B*))ij = for € 
F and (B* — B)ij = for F, respectively. Therefore, B* is identical 
to the 0y-projection of Bt onto S. □ 

We present a practical method of computing the projection of B^ onto 
S. Let and for t = 0,1,2,... be matrices generated by the ex- 
tended sparse quasi-Newton update with Algorithm 2. We show a method 
of computing H® = (B^)" 1 and = (bW)" 1 . Suppose we have ifW, 
then JE/w is obtained by solving the problem 

min DviH- 1 ,^)- 1 ), Hy k = s k . 

H&T){n) 

In the similar way of the proof of Theorem HJ the optimal solution H^' 
satisfies 

(t) _ »(det(m)-i) DFP (t) / Kdgt(gW)^) W 

- Kdet(^W)-i)^ ^ .M.«*J + ^ K det(FW)-i)J S J^- 

We need only the elements (H^)ij for G i 7 and the determinant 

det(fl"^). If we have the Choleskey factorization or the sparse clique- 
factorization formula of we can obtain these values by simple com- 
putation. Then, the matrix _£f(* +1 ) is given as the optimal solution of 

min DviH^AH®)- 1 ), H- l eS. 

H&D(n) 

As shown in the proof of Theorem [TJ i/( t+1 ) is also the optimal solution of 
^max ) det(-^y(if~ 1 )), e v {H-% = for all € F 

LetX = -6>y((i? (t+1) ) _1 ) = z/(det(i?(* +1 ))- 1 )if(' +1 ), then the sparse clique- 
factorization formula provides the factorized expression of X based on the 
information of u(det(H^)~' 1 )H^ , € F. The determinant of X is 

easily computed by the sparse clique-factorization formula. Then, we solve 
the the following equation, 

detX = ^^, z>0. 

z 

The Newton method is available to find the unique solution z* efficiently 
Using the solution z*, the matrix H^ t+l > is represented 

= -j—X. 

u(z*) 

The matrix _ff"(* +1 ) also has the expression of the sparse clique-factorization 
formula, and thus, it is available to the sequel computation. 
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6 Concluding Remarks 



Along the line of the research stared by Fletcher [TO], we considered the 
quasi-Newton update formula based on the Bregman divergences, and pre- 
sented a geometrical interpretation of the Hessian update formulae. We 
studied the invariance property of the update formulae. The sparse quasi- 
Newton methods were also considered based on the information geometry. 
We show that the information geometry is useful tool not only to better 
understand the quasi-Newton methods but also to design new update for- 
mulae. 

As pointed out in Section [3l the self-scaling quasi-Newton method with 
the popular scaling parameter is out of the formulae derived from the Breg- 
man divergence. Nocedal and Yuan proved that the self-scaling quasi- 
Newton method with the popular scaling parameter has some drawbacks 
[24j . An interesting future work is to pursue the relation between the nu- 
merical properties and the geometrical structure behind the optimization 
algorithms. In the study of the interior point methods, it has been made 
clear that geometrical viewpoint is useful [28]. The geometrical viewpoint 
will become important to investigate algorithms for numerical computation. 
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A Proof of Theorems |4] 

We prove the following lemma which is useful to show the existence of the 
optimal solution. 

Lemma 8. Let V be a potential and v = vy . For any C > the equation 



has the unique solution. 

Proof. We define the function ((z) by C,{z) = logz — (n — l)log^(z), then, 
the (|3Tj) is equivalent to the equation 



Cu(z) 



n-l 



= z, z > 



(31) 



logC = C(z) 



z > 0. 



(32) 
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Since the potential function satisfies lim^+o z/u(z) n ^ 1 = from the defi- 
nition, we have lirn^+o C( z ) = — oo. In terms of the derivative of C( z )j we 
have the following inequality 

d , l , .p(z) l 
—C(z ) = -- n-1 ^> — >0. 

az z z zn 

Thus, £(z) is an increasing function on R + . Moreover we have 

1 logz 



c(^)>ca)+ r —dz=Q{\)+ 

Ji zn 



n 



The above inequality implies that lim 2 _j.oo £(z) = oo. Since Q(z) is continu- 
ous, the equation (|32p has the unique solution. □ 



Proof of Theorem^ First, we show the existence of the matrix B k+ \ satis- 
fying (I19p . Lemma [8] now shows that there exists a solution z* > for the 
equation 

^(detBfc)^ 1 v ; 

By using the solution z*, we define the matrix B such that 



u(detB k ) ' ' v(detB k )> s~lyk 

then the determinant of B satisfies 

^-^^■^-^ 

in which the first equality comes from the formula det(^4+vu T ) = det(A)(l+ 
u T A~ 1 v) and the second one follows the definition of z*. Hence there exists 
B k+1 G PD(n) satisfying (fT9|) . 

Next, we show that the matrix B k+ \ in (]19p satisfies the optimality 
condition of (|18p . According to Giiler, et al. [16], the normal vector for the 
affine subspace 

M = {B G PD(n) | Bs k = y k } 
is characterized by the form of 

s k X T + XsJ G Sym(re), A G M n . (33) 
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Suppose B' € PD(n) be an optimal solution of (|18p . then B' satisfies the 
optimality condition that there exists a vector A S M. n such that 

V B D v (B,B k )\ B=B , = s k X T + XsJ 
-z^(det( J B / ))( J B / )" 1 + ^(det^))^ 1 = s k X T + XsJ , 

where V BDv{B,B k ) denotes the gradient of Dv{B,B k ) with respect to 
the variable B. Also, the optimal solution B' should satisfy the constraint 
B's k = y k . On the other hand, the matrix B k+ \ defined by (|19p satisfies 

! ^(det^ fc ) / ryBFGS r p .„ „n-i, A ^(detff fc ) \s k sj 

B k+i - (a + p — \\ B [Bk,Sk,yk\) +1 77775 — V H= — 

v(detB k+1 ) V v{detB k+1 )J sly k 

v(detB k ) DFP x / i/(detBfc) A s fc sj 



Kdet5 fe+1 ) B [S * 5 yfc ' Sfc] + " KdetW 
-u{det B k+1 )B^l x + ^(det ^fc)^ 1 = s k X T + AsJ, 

z/(det5jfc) x ^(detSfc+i) i^(det B k )y],B7 l y k 
A _ T ^fc y k TTr Sfc 77~t — \2 Sfe " 

The conditions s^y/c > and B k € PD(n) guarantees the existence of the 
above vector A. In addition, the direct computation yields that the con- 
straint B k+ \s k = y k is satisfied. Hence, B k+ \ satisfies the optimality con- 
dition. Since (|18p is a strictly convex problem, B k+ i is the unique optimal 
solution. □ 
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